tidyverse I: dplyr;
gapmindertidyverseII: readr, ggplot2;
Public Data, WDI, WIR, etctidyverse III: tidyr, etc.; WDI, WIR,
etctidyverse IV; WDI, WIR, etcbookdown site: https://bookdown.org
coursera courseslearnrAn inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.
R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.
The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
A GNU package, the official R software environment is written primarily in C, Fortran, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License.
“R Programming for Data Science” by Roger Peng
When you talk about choosing programming languages, I always say you shouldn’t pick them based on technical merits, but rather pick them based on the community. And I think the R community is like really, really strong, vibrant, free, welcoming, and embraces a wide range of domains. So, if there are like people like you using R, then your life is going to be much easier. That’s the first reason.
Interview: “Advice to Young (and Old) Programmers, H. Wickham”
RStudio is an integrated development environment, or IDE, for R programming.
RStudio is an integrated development environment (IDE) for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser.
To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don’t try and pick a mirror that’s close to you: instead use the cloud mirror, https://cloud.r-project.org, which automatically figures it out for you.
A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly.
Download and install it from http://www.rstudio.com/download.
RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know.
Or,
In this way the working directory of the session is set to the
project directory and R can search releted files without difficulty
(getwd(), setwd())
RStudio Cloud is a lightweight, cloud-based solution that allows anyone to do, share, teach and learn data science online.
Start RStudio and create a project, or login to Posit Cloud and create a project.
Input the following codes into Console in the left bottom pane.
head(cars)
str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
plot(cars)
plot(cars) # cars: Speed and Stopping Distances of Cars
abline(lm(cars$dist~cars$speed))
lm(cars$dist~cars$speed)
Call:
lm(formula = cars$dist ~ cars$speed)
Coefficients:
(Intercept) cars$speed
-17.579 3.932
summary(lm(cars$dist~cars$speed))
Call:
lm(formula = cars$dist ~ cars$speed)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
head(cars): The first 6 rows of the pre-installed data
cars.str(cars): The data structure of the pre-installed data
cars.summary(cars): The summary of the pre-installed data
cars.plot(cars): A scatter plot of the pre-installed data
cars.
plot(cars$dist~cars$speed)cars$dist, cars$[[2]],
cars[,2] are sameabline(lm(cars$dist~cars$speed)): Add a regression line
of a linear modellm(cars$dist~cars$speed): The equation of the
regression linesummary(lm(cars$dist~cars$speed): The summary of the
linear regression modelhist(cars$dist)
hist(cars$speed)
View(cars)?cars: same as help(cars)??cars: same as `help.search(“cars”)datasets?datasets
library(help = "datasets")
data() shows all data already attached and
available.
Pick a data in the datasets package and try
head()str()summary()and some more.
irishead(iris)
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Can you plot?
plot(iris$Sepal.Length, iris$Sepal.Width)
tidyverse PackagesSys.setenv(LANG = "en")
dir.create("data")
basics.Rcoronavirus.RTo run a code: at the cursor press Ctrl+Shift+Enter (Win) or Cmd+Shift+Enter (Mac).
R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN (the Comprehensive R Archive Network).
You can install packages by “Install Packages…” under “Tool” in the top menu.
install.packages("tidyverse")install.packages("rmarkdown")Choose R Notebook from the pull down File menu in the top bar.
Default* is as follows
---
title: "R Notebook"
output: html_notebook
---
Template
---
title: "Title of R Notebook"
author: "ID and Your Name"
date: "2023-01-20"
output:
html_notebook:
# number_sections: yes
# toc: true
# toc_float: true
---
number_sections: no.toc: true - default is
toc: false.toc_float: true - default is
toc_float: falseInsert Chunk in Code pull down menu in the top bar, or use the C button on top. You can use shortcut keys listed under Tools in the top bar.
library(tidyverse)
Let us assign the iris data in the pre-installed package
datasets to df_iris. You can give any name
starting from an alphabet, though there are some rules.
df_iris <- datasets::iris
class(df_iris)
[1] "data.frame"
The class of data iris is data.frame, the
basic data class of R. You can assign the same data as a
tibble, the data class of tidyverse as
follows.
tbl_iris <- as_tibble(datasets::iris)
class(tbl_iris)
[1] "tbl_df" "tbl" "data.frame"
df_iris <- iris can replace
df_iris <- datasets::iris because the package
datasets is installed and attached as default. Since you
may have other data called iris included in a different
package or you may have changed iris before, it is safer to
specify the name of the package with the name of the data.tf_iris and tbl_iris behave differently. It is
because of the default settings of R Markdown.The View command open up a window to show the contents
of the data and you can use the filter as well.
View(df_iris)
The following simple command also shows the data.
df_iris
The output within R Notebook is a tibble style. Try the same command in Console.
slice(df_iris, 1:10)
1:10
[1] 1 2 3 4 5 6 7 8 9 10
Let us look at the structure of the data. You can try
str(df_iris) on Console or by adding a code chunk in R
Notebook introducing later.
glimpse(df_iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
There are six types of data in R; Double, Integer, Character, Logical, Raw, Complex.
The names after $ are column names. If you call
df_iris$Species, you have the Species column. Species is in
the 5th collumn, typeof(df_iris[[5]]) does the same as the
next.
df_iris[2,4] =0.2 is the fourth entry of
Sepal.Width.
typeof(df_iris$Species)
[1] "integer"
class(df_iris$Species)
[1] "factor"
For factors = fct see the
R Document or an explanation in Factor
in R: Categorical Variable & Continuous Variables.
typeof(df_iris$Sepal.Length)
[1] "double"
class(df_iris$Sepal.Length)
[1] "numeric"
Q1. What are the differences ofdf_iris,
slice(df_iris, 1:10) and glimpse(df_iris)
above?
Q2. What are the differences ofdf_iris,
slice(df_iris, 1:10) and glimpse(df_iris) in
the console?
The following is very convenient to get the summary information of a data.
summary(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Minimum, 1st Quadrant (25%), Median, Mean, 3rd Quadrant (75%), Maximum, and the count of each factor.
We use ggplot to draw graphs. The scatter plot is a
projection of data with two variables \(x\) and \(y\).
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
geom_point()
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
Add title and labels adding labs().
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
geom_point() +
labs(title = "Title", x = "Label for x", y = "Label for y")
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
labs(title = "Scatter Plot of Sepal Data of Iris", x = "Sepal Length", y = "Sepal Width")
Add different colors automatically to each species. Can you see each group?
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, shape = Species)) +
geom_point()
The boxplot compactly displays the distribution of a continuous variable.
ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
geom_boxplot()
Visualize the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars.
ggplot(data = df_iris, aes(x = Sepal.Length)) +
geom_histogram()
Change the number of bins by bins =
<number>.
ggplot(data = df_iris, aes(x = Sepal.Length)) +
geom_histogram(bins = 10)
Professor Kaizoji will cover the mathematical models and hypothesis testings.
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
swirl} website: https://swirlstats.comswirl for
exercises.You can install other swirl courses as well
install_course("Course Name Here")install.packages("swirl") # Only the first time.
library(swirl) # Everytime you start swirl
swirl() # Everytime you start or resume swirl
1: Basic Building Blocks 2: Workspace and Files 3: Sequences of Numbers
4: Vectors 5: Missing Values 6: Subsetting Vectors
7: Matrices and Data Frames 8: Logic 9: Functions
10: lapply and sapply 11: vapply and tapply 12: Looking at Data
13: Simulation 14: Dates and Times 15: Base Graphics
1, 3, 4, 5, 6, 7, 12, 15, 14, 8, 9, 10, 11, 13, 2
swirl Session… <– That’s your cue to press Enter to continue
You can exit swirl and return to the R prompt (>) at any time by pressing the Esc key.
If you are already at the prompt, type bye() to exit and save your progress. When you exit properly, you’ll see a short message letting you know you’ve done so.
When you are at the R prompt (>):
You will encounter the message like ‘Would you like to receive credit
for completing this course on Coursera.org?’ at the end of each course.
This is for coursera courses. Select ‘NO’.
basics.RThe script with the outputs.
#################
#
# basics.R
#
################
# 'Quick R' by DataCamp may be a handy reference:
# https://www.statmethods.net/management/index.html
# Cheat Sheet at RStudio: https://www.rstudio.com/resources/cheatsheets/
# Base R Cheat Sheet: https://github.com/rstudio/cheatsheets/raw/main/base-r.pdf
# To execute the line: Control + Enter (Window and Linux), Command + Enter (Mac)
## try your experiments on the console
## calculator
3 + 7
### +, -, *, /, ^ (or **), %%, %/%
3 + 10 / 2
3^2
2^3
2*2*2
### assignment: <-, (=, ->, assign())
x <- 5
x
#### object_name <- value, '<-' shortcut: Alt (option) + '-' (hyphen or minus)
#### Object names must start with a letter and can only contain letter, numbers, _ and .
this_is_a_long_name <- 5^3
this_is_a_long_name
char_name <- "What is your name?"
char_name
#### Use 'tab completion' and 'up arrow'
### ls(): list of all assignments
ls()
ls.str()
#### check Environment in the upper right pane
### (atomic) vectors
5:10
a <- seq(5,10)
a
b <- 5:10
identical(a,b)
seq(5,10,2) # same as seq(from = 5, to = 10, by = 2)
c1 <- seq(0,100, by = 10)
c2 <- seq(0,100, length.out = 10)
c1
c2
length(c1)
#### ? seq ? length ? identical
(die <- 1:6)
zero_one <- c(0,1) # same as 0:1
die + zero_one # c(1,2,3,4,5,6) + c(0,1). re-use
d1 <- rep(1:3,2) # repeat
d1
die == d1
d2 <- as.character(die == d1)
d2
d3 <- as.numeric(die == d1)
d3
### class() for class and typeof() for mode
### class of vectors: numeric, charcters, logical
### types of vectors: doubles, integers, characters, logicals (complex and raw)
typeof(d1); class(d1)
typeof(d2); class(d2)
typeof(d3); class(d3)
sqrt(2)
sqrt(2)^2
sqrt(2)^2 - 2
typeof(sqrt(2))
typeof(2)
typeof(2L)
5 == c(5)
length(5)
### Subsetting
(A_Z <- LETTERS)
A_F <- A_Z[1:6]
A_F
A_F[3]
A_F[c(3,5)]
large <- die > 3
large
even <- die %in% c(2,4,6)
even
A_F[large]
A_F[even]
A_F[die < 4]
### Compare df with df1 <- data.frame(number = die, alphabet = A_F)
df <- data.frame(number = die, alphabet = A_F, stringsAsFactors = FALSE)
df
df$number
df$alphabet
df[3,2]
df[4,1]
df[1]
class(df[1])
class(df[[1]])
identical(df[[1]], die)
identical(df[1],die)
####################
# The First Example
####################
plot(cars)
# Help
? cars
# cars is in the 'datasets' package
data()
# help(cars) does the same as ? cars
# You can use Help tab in the right bottom pane
help(plot)
? par
head(cars)
str(cars)
summary(cars)
x <- cars$speed
y <- cars$dist
min(x)
mean(x)
quantile(x)
plot(cars)
abline(lm(cars$dist ~ cars$speed))
summary(lm(cars$dist ~ cars$speed))
boxplot(cars)
hist(cars$speed)
hist(cars$dist)
hist(cars$dist, breaks = seq(0,120, 10))
gapminder PackageHans Rosling was a Swedish physician, academic, and public speaker. He was a professor of international health at Karolinska Institute[4] and was the co-founder and chairman of the Gapminder Foundation, which developed the Trendalyzer software system. (wikipedia)
recognizing when a decision feels urgent and remembering that it rarely is.
To control the urgency instinct, take small steps.
# install.packages("gapminder")
library(gapminder)
df <- gapminder
df
glimpse(df)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Euro…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007, 1952…
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822, 41.674, 41.7…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12881816, 13867957…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 978.0114, 852.39…
summary(df)
country continent year lifeExp pop
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07
Australia : 12 Max. :2007 Max. :82.60 Max. :1.319e+09
(Other) :1632
gdpPercap
Min. : 241.2
1st Qu.: 1202.1
Median : 3531.8
Mean : 7215.3
3rd Qu.: 9325.5
Max. :113523.1
tidyversermarkdowngapminderEDA from r4ds
Today: R Markdown and dplyr
What is R Markdown: https://vimeo.com/178485416
R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both
R Notebooks are an implementation of Literate Programming that allows for direct interaction with R while producing a reproducible document with publication-quality output.
An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input.
(Reference: R Markdown: The Definitive Guide, 3.2 Notebook)
Important: Implementation of Reproducible Research and Literate Programming
Useful to Render into Various Formats: R Notebook (HTML), R Markdown (HTML), PDF, MS Word, MS Powerpoint, Ioslides Presentation (HTML), Slidy Presentation (HTML), Beamer Presentation (PDF), etc.
Literate programming is an approach to programming introduced by Donald Knuth in which a program is given as an explanation of the program logic in a natural language, such as English, interspersed with snippets of macros and traditional source code, from which a compilable source code can be generated
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.
Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available.
R Markdown is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. It:
Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!
Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.
Helps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share why you did it with your colleagues or lab mates.
rmarkdown
install.packages("rmarkdown")tinytex (for pdf generation)
install.packages('tinytex')tinytex::install_tinytex() #
install TinyTeX
Terminal in the
left below pane:
plot(cars) and then Preview again.Template to submit your assignment of this course: RNotebook_Template.nb.html
title: "Title of R Notebook"
author: "ID and Your Name"
date: "2023-01-20"
output:
html_notebook: null
Various Output Formats: test-rmarkdown.nb.html
title: "Testing R Markdown Formats"
author: "DS-SL"
date: "2023-01-20"
output:
html_notebook:
number_sections: yes
pdf_document:
number_sections: yes
html_document:
df_print: paged
number_sections: yes
word_document:
number_sections: yes
powerpoint_presentation: default
ioslides_presentation:
widescreen: yes
smaller: yes
slidy_presentation: default
beamer_presentation: default
--- is page break for presentation formats.ref-doc-style.docxref-doc-style.docxref-doc-style.docx as reference_doc in YAML with
indention as below word_document:
number_sections: yes
reference_doc: ref-doc-style.docx
powerpoint_presentation:
reference_doc: ref-ppt-style.pptx
Output Options at the bottom of the gear
icon next to Preview/knit button.$\frac{a}{b}$ for \(\frac{a}{b}\)_italic_, Bold
text by **bold**R Studio introduced Visual Editor towards the end of 2021. It seems to be stable but it is not perfect to go back and forth from the original editor using tags. I always use the original editor and I am confident on all the functions of it but I do not have much experience on Visual Editor. [My Note in QALL401 2021]
dplyrdplyr
Overviewdplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
select() picks variables based on their names.filter() picks cases based on their values.mutate() adds new variables that are functions of
existing variablessummarise() reduces multiple values down to a single
summary.arrange() changes the ordering of the rows.group_by() takes an existing tbl and converts it into a
grouped tbl.You can learn more about them in vignette(“dplyr”). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette(“two-table”).
If you are new to dplyr, the best place to start is the data transformation chapter in R for data science.
select:
Subset columns using their names and types| Helper Function | Use | Example |
|---|---|---|
| - | Columns except | select(babynames, -prop) |
| : | Columns between (inclusive) | select(babynames, year:n) |
| contains() | Columns that contains a string | select(babynames, contains(“n”)) |
| ends_with() | Columns that ends with a string | select(babynames, ends_with(“n”)) |
| matches() | Columns that matches a regex | select(babynames, matches(“n”)) |
| num_range() | Columns with a numerical suffix in the range | Not applicable with babynames |
| one_of() | Columns whose name appear in the given set | select(babynames, one_of(c(“sex”, “gender”))) |
| starts_with() | Columns that starts with a string | select(babynames, starts_with(“n”)) |
filter:
Subset rows using column values| Logical operator | tests | Example |
|---|---|---|
| > | Is x greater than y? | x > y |
| >= | Is x greater than or equal to y? | x >= y |
| < | Is x less than y? | x < y |
| <= | Is x less than or equal to y? | x <= y |
| == | Is x equal to y? | x == y |
| != | Is x not equal to y? | x != y |
| is.na() | Is x an NA? | is.na(x) |
| !is.na() | Is x not an NA? | !is.na(x) |
arrange
and Pipe %>%arrange() orders the rows of a data frame by the values
of selected columns.Unlike other dplyr verbs, arrange() largely
ignores grouping; you need to explicitly mention grouping variables (`or
use .by_group = TRUE) in order to group by them, and functions of
variables are evaluated once per data frame, not once per group.
pipes
in R for Data Science.mutateCreate, modify, and delete columns
Useful mutate functions
+, -, log(), etc., for their usual mathematical meanings
lead(), lag()
dense_rank(), min_rank(), percent_rank(), row_number(), cume_dist(), ntile()
cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()
na_if(), coalesce()### group_by() and
summarise()
group_bysummarise
or summarizeSo far our summarise() examples have relied on sum(), max(), and mean(). But you can use any function in summarise() so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:
dplyr by Examplesirisiris
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
select 1 - columns 1, 2, 5select(iris, c(1,2,5))
select 2 - except Speciesselect(iris, -Species)
select 3 - change column namesselect(iris, sl = Sepal.Length, sw = Sepal.Width, sp = Species)
filter - by namesfilter(iris, Species == "virginica")
arrange - ascending and descending orderarrange(iris, Sepal.Length, desc(Sepal.Width))
mutate - rankiris %>% mutate(sl_rank = min_rank(Sepal.Length)) %>% arrange(sl_rank)
group_by and summarizeiris %>%
group_by(Species) %>%
summarize(sl = mean(Sepal.Length), sw = mean(Sepal.Width),
pl = mean(Petal.Length), pw = mean(Petal.Width))
mean() or mean(x, na.rm = TRUE) -
arithmetic mean (average)median() or
median(x, na.rm = TRUE) - mid valueFor more examples see
dplyrdplyr by Examples II - gapminderggplot2 Overviewggplot2 is a system for declaratively creating graphics,
based on The Grammar of Graphics.
You provide the data, tell ggplot2 how to map variables to aesthetics,
what graphical primitives to use, and it takes care of the details.
Examples
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy))
Template
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
gapminderGapminder was founded by Ola Rosling, Anna Rosling Rönnlund, and Hans Rosling
Gapminder: https://www.gapminder.org
R Package gapminder by Jennifer Bryan
Package Help ?gapminder or gapminder in
the search window of Help
library(tidyverse)
library(gapminder)
library(WDI)
gapminder datadf <- gapminder
df
glimpse(df)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Euro…
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007, 1952…
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822, 41.674, 41.7…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12881816, 13867957…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 978.0114, 852.39…
summary(df)
country continent year lifeExp pop
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07
Australia : 12 Max. :2007 Max. :82.60 Max. :1.319e+09
(Other) :1632
gdpPercap
Min. : 241.2
1st Qu.: 1202.1
Median : 3531.8
Mean : 7215.3
3rd Qu.: 9325.5
Max. :113523.1
ggplot(df, aes(x = year, y = lifeExp)) + geom_point()
ggplot(df, aes(x = year, y = lifeExp)) + geom_line()
ggplot(df, aes(x = year, y = lifeExp)) + geom_boxplot()
typeof(pull(df, year)) # same as typeof(df$year)
[1] "integer"
ggplot(df, aes(y = lifeExp, group = year)) + geom_boxplot()
ggplot(df, aes(x = as_factor(year), y = lifeExp)) + geom_boxplot()
dplyrfilterdf %>% filter(country == "Afghanistan") %>%
ggplot(aes(x = year, y = lifeExp)) + geom_line()
df %>% filter(country %in% c("Afghanistan", "Japan")) %>%
ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line()
df %>% distinct(country) %>% pull()
[1] Afghanistan Albania Algeria
[4] Angola Argentina Australia
[7] Austria Bahrain Bangladesh
[10] Belgium Benin Bolivia
[13] Bosnia and Herzegovina Botswana Brazil
[16] Bulgaria Burkina Faso Burundi
[19] Cambodia Cameroon Canada
[22] Central African Republic Chad Chile
[25] China Colombia Comoros
[28] Congo, Dem. Rep. Congo, Rep. Costa Rica
[31] Cote d'Ivoire Croatia Cuba
[34] Czech Republic Denmark Djibouti
[37] Dominican Republic Ecuador Egypt
[40] El Salvador Equatorial Guinea Eritrea
[43] Ethiopia Finland France
[46] Gabon Gambia Germany
[49] Ghana Greece Guatemala
[52] Guinea Guinea-Bissau Haiti
[55] Honduras Hong Kong, China Hungary
[58] Iceland India Indonesia
[61] Iran Iraq Ireland
[64] Israel Italy Jamaica
[67] Japan Jordan Kenya
[70] Korea, Dem. Rep. Korea, Rep. Kuwait
[73] Lebanon Lesotho Liberia
[76] Libya Madagascar Malawi
[79] Malaysia Mali Mauritania
[82] Mauritius Mexico Mongolia
[85] Montenegro Morocco Mozambique
[88] Myanmar Namibia Nepal
[91] Netherlands New Zealand Nicaragua
[94] Niger Nigeria Norway
[97] Oman Pakistan Panama
[100] Paraguay Peru Philippines
[103] Poland Portugal Puerto Rico
[106] Reunion Romania Rwanda
[109] Sao Tome and Principe Saudi Arabia Senegal
[112] Serbia Sierra Leone Singapore
[115] Slovak Republic Slovenia Somalia
[118] South Africa Spain Sri Lanka
[121] Sudan Swaziland Sweden
[124] Switzerland Syria Taiwan
[127] Tanzania Thailand Togo
[130] Trinidad and Tobago Tunisia Turkey
[133] Uganda United Kingdom United States
[136] Uruguay Venezuela Vietnam
[139] West Bank and Gaza Yemen, Rep. Zambia
[142] Zimbabwe
142 Levels: Afghanistan Albania Algeria Angola Argentina Australia Austria ... Zimbabwe
df %>% filter(country %in% c("Brazil", "Russia", "India", "China")) %>%
ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line()
Russian data is missing.
lifeExp to pop and
gdpPercap and do the same.group_by and summarizeLet us use the variable continent and summarize the
data.
df_lifeExp <- df %>% group_by(continent, year) %>%
summarize(mean_lifeExp = mean(lifeExp), median_lifeExp = median(lifeExp), max_lifeExp = max(lifeExp), min_lifeExp = min(lifeExp), .groups = "keep")
df_lifeExp
df %>% filter(year %in% c(1952, 1987, 2007)) %>%
ggplot(aes(x=as_factor(year), y = lifeExp, fill = continent)) +
geom_boxplot()
df_lifeExp %>% ggplot(aes(x = year, y = mean_lifeExp, color = continent)) +
geom_line()
df_lifeExp %>% ggplot(aes(x = year, y = mean_lifeExp, color = continent, linetype = continent)) +
geom_line()
df_lifeExp %>% ggplot() +
geom_line(aes(x = year, y = mean_lifeExp, color = continent)) +
geom_line(aes(x = year, y = median_lifeExp, linetype = continent))
R Markdown and dplyr
a2_123456.nb.html)
a2_123456.Rmd,a2_123456.nb.html,a2_123456.nb.html to Moodle.Pick data from the built-in datasets besides cars.
(library(help = "datasets") or go to the site The
R Datasets Package)
head(), str(), …, and create at least
one chart using ggplot2 - Code Chunk.
library(tidyverse) in the first
code chunk.Load gapminder by
library(gapminder).
pop or gdpPercap, or both, one
country in the data, a group of countries in the data.lifeExp.)Due: 2023-01-09 23:59:00. Submit your R Notebook file in Moodle (The Second Assignment). Due on Monday!
gapminder
df_wdi <- WDI(
country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD")
)
df_wdi
df_wdi_extra <- WDI(
country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD"),
extra = TRUE
)
df_wdi_extra
library(tidyverse)
library(gapminder)
library(maps)
library(WDI)
library(readxl)
library(ggrepel)
tidyverse and gapminder
already.WDI, install it.ggrepel but if you want to use it,
install it.maps and readxl are bundled in
tidyverse but need to be attached by
library.df <- gapminder
df
gdpPercap of ASEAN countriesasean <- c("Brunei", "Cambodia", "Laos", "Myanmar",
"Philippines", "Indonesia", "Malaysia", "Singapore")
df %>% filter(country %in% asean) %>%
ggplot(aes(x = year, y = gdpPercap, col = country)) + geom_line()
df %>% filter(country %in% asean) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) + geom_point()
df %>% filter(country %in% asean) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) +
geom_point() + coord_trans(x = "log10", y = "identity")
\(\log_{10}{100}\) = 2, \(\log_{10}{1000}\) = 3, \(\log_{10}{10000}\) = 4
library(ggrepel)
df2007 <- df %>% filter(country %in% asean, year == 2007)
df %>% filter(country %in% asean) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, col = country))+
geom_line() + geom_label_repel(data = df2007, aes(label = country)) + geom_point() +
coord_trans(x = "log10", y = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1), legend.position = "none") +
labs(title = "Life Expectancy vs GDP Per Capita of ASEAN Countries",
subtitle = "Data: gapminder package", x = "GDP per Capita", y = "Life Expectancy")
df_wdi <- WDI(
country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD")
)
df_wdi
df_wdi_extra <- WDI(
country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD"),
extra = TRUE
)
df_wdi_extra
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
The term Open Data has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.
WDI(country = "all",
indicator = "NY.GDP.PCAP.KD",
start = 1960,
end = 2020,
extra = FALSE,
cache = NULL)
c('women_private_sector' = 'BI.PWK.PRVS.FE.ZS')library(WDI)
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", cache = NULL)
WDIsearch(string = "population",
field = "name", short=FALSE, cache = NULL)
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", short = FALSE, cache = NULL)
WDIsearch(string = "gdp",
field = "name", short = TRUE, cache = NULL)
WDIbulk downloads the zip file of Bulk Downloads in WDI site , it is a list containing 6 data frames: Data, Country, Series, Country-Series, Series-Time, FootNote.
Download an updated list of available WDI indicators from the World Bank website. Returns a list for use in the WDIsearch function.
wdi_cache <- WDIcache()
Downloading all series information from the World Bank website can
take time. The WDI package ships with a local data object with
information on all the series available on 2012-06-18. You can update
this database by retrieving a new list using WDIcache, and
then feeding the resulting object to WDIsearch via the
cache argument.
wdi_cache
List of 2 data frames
The first character matrix includes a full list of WDI series. This list is updated semi-regularly. Users can refresh the list manually using the ‘WDIcache()’ function and search in the updated list using the ‘cache’ argument.
WDI_data$country %>% filter(country == "Japan")
WDIsearch(string = "gdp",
field = "name", short = FALSE, cache = NULL) #cache = wdi_cache
Find indicators:
WDIsearch(string = "gdp", field = "name", short = FALSE, cache = NULL)WDIsearch(string = "gdp", field = "name", short = FALSE, cache = wdi_cache)WDIsearch(string = "NY.GDP.PCAP.KD", field = "indicator", short = FALSE, cache = NULL)WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator",
short = FALSE, cache = NULL) #cache = wdi_cache
WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator",
short = FALSE, cache = NULL) %>% pull(description) #cache = wdi_cache
[1] "Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring."
co2pcap <- WDI(country = "all", indicator = "EN.ATM.CO2E.PC", start = 1960, end = NULL, extra = TRUE, cache = NULL) #cache = wdi_cache
Error: 'data/co2pcap.csv' does not exist in current working directory ('/Users/hsuzuki/Documents/_class/gsclasses/2022/da4r2022_note').
co2pcap
write_csv(co2pcap, "data/co2pcap.csv")
co2pcap %>% filter(country %in% c("World", "Japan", "United States", "China")) %>%
ggplot(aes(x = year, y = EN.ATM.CO2E.PC, color = country)) +
geom_line()
co2pcap %>% filter(!is.na(EN.ATM.CO2E.PC)) %>% pull(year) %>% summary()
co2pcap %>%
filter(country %in% c("World", "Japan", "United States", "China"), year %in% 1990:2019) %>%
ggplot(aes(x = year, y = EN.ATM.CO2E.PC, color = country)) +
geom_line()
co2pcap %>%
filter(income != "Aggregates", year == 2019) %>%
ggplot(aes(x = income, y = EN.ATM.CO2E.PC, fill = income)) +
geom_boxplot()
co2pcap %>%
filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
ggplot(aes(x = income, y = EN.ATM.CO2E.PC, fill = income)) +
geom_boxplot()
boxplot: https://vimeo.com/222358034co2pcap %>%
filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
group_by(income) %>%
summarize(min = min(EN.ATM.CO2E.PC), med = median(EN.ATM.CO2E.PC), max = max(EN.ATM.CO2E.PC), IQR = IQR(EN.ATM.CO2E.PC), n = n())
co2pcap %>%
filter(income != "Aggregates", year == 2019, !is.na(EN.ATM.CO2E.PC)) %>%
filter(!income %in% c("High income", "Low income", "Lower middle income", "Upper middle income"))
co2pcap %>%
filter(income != "Aggregates", year == 2019) %>%
filter(income == "Not classified")
co2pcap %>%
filter(income != "Aggregates", year == 2019) %>%
ggplot(aes(map_id = country)) +
geom_map(aes(fill = income), map = world_map) + expand_limits(x = world_map$long, y = world_map$lat) +
labs(title = "Income Levels in 2019")
co2pcap %>% distinct(country)
world_map %>% distinct(region)
world_map0 <- world_map %>%
mutate(region = case_when(region == "Macedonia" ~ "North Macedonia",
region == "Ivory Coast" ~ "Cote d'Ivoire",
region == "Democratic Republic of the Congo" ~ "Congo, Dem. Rep.",
region == "Republic of Congo" ~ "Congo, Rep.",
region == "UK" ~ "United Kingdom",
region == "USA" ~ "United States",
region == "Laos" ~ "Lao PDR",
region == "Slovakia" ~ "Slovak Republic",
region == "Saint Lucia" ~ "St. Lucia",
region == "Kyrgyzstan" ~ "Kyrgyz Republic",
region == "Micronesia" ~ "Micronesia, Fed. Sts.",
region == "Swaziland" ~ "Eswatini",
region == "Virgin Islands" ~ "Virgin Islands (U.S.)",
region == "Russia" ~ "Russian Federation",
region == "Egypt" ~ "Egypt, Arab Rep.",
region == "South Korea" ~ "Korea, Rep.",
region == "North Korea" ~ "Korea, Dem. People's Rep.",
region == "Iran" ~ "Iran, Islamic Rep.",
region == "Brunei" ~ "Brunei Darussalam",
region == "Venezuela" ~ "Venezuela, RB",
region == "Yemen" ~ "Yemen, Rep.",
region == "Bahamas" ~ "Bahamas, The",
region == "Syria" ~ "Syrian Arab Republic",
region == "Turkey" ~ "Turkiye",
region == "Cape Verde" ~ "Cabo Verde",
region == "Gambia" ~ "Gambia, The",
region == "Czech Republic" ~ "Czechia",
TRUE ~ region))
write_csv(world_map0, "data/world_map0.csv")
map0_url <- "https://icu-hsuzuki.github.io/da4r2022_note/data/world_map0.csv"
world_map0 <- read_csv(map0_url)
co2pcap %>% filter(income != "Aggregates", year == 2019) %>%
anti_join(world_map0, by = c("country"="region"))
world_map0 %>% anti_join(co2pcap, by = c("region"="country")) %>% distinct(region) %>% arrange(region)
world_map0 %>% left_join(iso3166, by = c("region" = "ISOname")) %>%
filter(is.na(a2)) %>% distinct(region)
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
The rest of this tutorial will look at these two questions. To make the discussion easier, let’s define some terms…
ggplot2 Basicsvisualization
library(readxl)
url_summary <- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
download.file(url = url_summary, destfile = "data/WIR2022s.xlsx")
excel_sheets("data/WIR2022s.xlsx")
Note that the sheet name of F14 has period at the end.
df_f14 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F14.")
df_f14
\n for line break in the title.df_f14 %>%
ggplot(aes(x = Group, y = Share)) +
geom_col()
df_f14 %>%
ggplot(aes(x = Group, y = Share)) +
geom_col(width = 0.5, fill = scales::hue_pal()(1)[1]) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 14. Global carbon inequality, \n2019 Group contribution to world emissions (%)",
x = "", y = "Share of world emissions (%)")
width = 0.5: width of barsfill = scales::hue_pal()(1)[1]): hue scale
scale_y_continuous(labels = scales::percent_format(accuracy = 1)):
percent format
labs(title = "Figure 14. Global carbon inequality, \n2019 Group contribution to world emissions (%)", x = "", y = "Share of world emissions (%)")
\n is for line feeddf_f1 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F1")
df_f1
df_f1_rev %>%
ggplot(aes(x = cat, y = value, fill = group)) +
geom_col(position = "dodge")
ggplot2Visualize Data
WDI and ggplot2
a3_123456.nb.html by
replacing 123456 with your ID)
a3_123456.Rmd,a3_123456.nb.html,a3_123456.nb.html to Moodle.Choose at least one indicator of WDI
WDIExplore the data using visualization using
ggplot2
Observations and difficulties encountered.
Due: 2023-01-16 23:59:00. Submit your R Notebook file in Moodle (The Third Assignment). Due on Monday!
library(tidyverse)
library(readxl)
url_summary <- "https://wir2022.wid.world/www-site/uploads/2022/03/WIR2022TablesFigures-Summary.xlsx"
download.file(url = url_summary, destfile = "data/WIR2022s.xlsx")
excel_sheets("data/WIR2022s.xlsx")
df_f1 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F1")
df_f1
df_f1_rev %>%
ggplot(aes(x = cat, y = value, fill = group)) +
geom_col(position = "dodge")
tidyrTidy Your Data
“Data comes in many formats, but R prefers just one: tidy data.” — Garrett Grolemund
Data can come in a variety of formats, but one format is easier to use in R than the others. This format is known as tidy data. A data set is tidy if:
“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham
“all happy families are all alike; each unhappy family is unhappy in its own way” - Tolstoy’s Anna Karenina
tidyr
Basicspivot_longer()pivot_longer(data, cols = <columns to pivot into longer format>,
names_to = <name of the new character column>, # e.g. "group", "category", "class"
values_to = <name of the column the values of cells go to>) # e.g. "value", "n"
df_f1
(df_f1_rev <- df_f1 %>% pivot_longer(-1, names_to = "group", values_to = "value"))
df_f1_rev %>%
ggplot(aes(x = ...1, y = value, fill = group)) +
geom_col(position = "dodge")
df_f1_rev %>% filter(group != "Top 1%") %>%
ggplot() +
geom_col(aes(x = ...1, y = value, fill = group), position = "dodge") +
geom_text(aes(x = ...1, y = value, group = group,
label = scales::label_percent(accuracy=1)(value)),
position = position_dodge(width = 0.9)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 1. Global income and wealth inequality, 2021",
x = "", y = "Share of total income or wealth", fill = "")
Interpretation: The global bottom 50% captures 8.5%
of total income measured at Purchasing Power Parity (PPP). The global
bottom 50% owns 2% of wealth (at Purchasing Power Parity). The global
top 10% owns 76% of total Household wealth and captures 52% of total
income in 2021. Note that top wealth holders are not necessarily top
income holders. Incomes are measured after the operation of pension and
unemployment systems and before taxes and transfers.
Sources and series: wir2022.wid.world/methodology.
pivot_wider()
In Console: vignette(“pivot”)
pivot_wider(data,
names_from = <name of the column (or columns) to get the name of the output column>,
values_from = <name of the column to get the value of the output>)
pivot_wider(data, names_from = group, values_from = value)
F4 and F13 are similar. Please use pivot_longer to tidy
the data and create charts.
df_f3 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F3")
df_f3
df_f3$T10B50 %>% summary()
df_f3 %>% ggplot() + geom_histogram(aes(T10B50))
df_f3 %>% arrange(desc(T10B50))
df_f3 %>%
mutate(`Top 10 Bottom 50 Ratio` = cut(T10B50,breaks = c(5, 12, 13, 16, 19,140),
include.lowest = FALSE))
world_map <- map_data("world")
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` = cut(T10B50,breaks = c(5, 12, 13, 16, 19,140),
include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) +
geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), map = world_map) +
expand_limits(x = world_map$long, y = world_map$lat)
world_map_wir <- world_map
world_map_wir$region[
world_map_wir$region=="Democratic Republic of the Congo"]<-"DR Congo"
world_map_wir$region[world_map_wir$region=="Republic of Congo"]<-"Congo"
world_map_wir$region[world_map_wir$region=="Ivory Coast"]<-"Cote dIvoire"
world_map_wir$region[world_map_wir$region=="Vietnam"]<-"Viet Nam"
world_map_wir$region[world_map_wir$region=="Russia"]<-"Russian Federation"
world_map_wir$region[world_map_wir$region=="South Korea"]<-"Korea"
world_map_wir$region[world_map_wir$region=="UK"]<-"United Kingdom"
world_map_wir$region[world_map_wir$region=="Brunei"]<-"Brunei Darussalam"
world_map_wir$region[world_map_wir$region=="Laos"]<-"Lao PDR"
world_map_wir$region[world_map_wir$region=="Cote dIvoire"]<-"Cote d'Ivoire"
world_map_wir$region[world_map_wir$region=="Cape Verde"]<- "Cabo Verde"
world_map_wir$region[world_map_wir$region=="Syria"]<- "Syrian Arab Republic"
world_map_wir$region[world_map_wir$region=="Trinidad"]<- "Trinidad and Tobago"
world_map_wir$region[world_map_wir$region=="Tobago"]<- "Trinidad and Tobago"
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50, breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) +
geom_map(aes(fill = `Top 10 Bottom 50 Ratio`),
map = world_map_wir) +
expand_limits(x = world_map_wir$long, y = world_map_wir$lat)
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`),
map = world_map_wir) + expand_limits(x = world_map_wir$long, y = world_map_wir$lat) +
coord_map("orthographic", orientation = c(25, 60, 0))
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`),
map = world_map_wir) + expand_limits(x = world_map_wir$long, y = world_map_wir$lat) +
coord_map("orthographic", orientation = c(15, -80, 0))
df_f3 %>% mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) + geom_map(aes(fill = `Top 10 Bottom 50 Ratio`),
map = world_map_wir) +
expand_limits(x = world_map_wir$long, y = world_map_wir$lat)
df_f3 %>%
mutate(`Top 10 Bottom 50 Ratio` =
cut(T10B50,breaks = c(5, 12, 13, 16, 19,140), include.lowest = FALSE)) %>%
ggplot(aes(map_id = Country)) +
geom_map(aes(fill = `Top 10 Bottom 50 Ratio`), map = world_map_wir) +
expand_limits(x = world_map_wir$long, y = world_map_wir$lat) +
labs(title = "Figure 3. Top 10/Bottom 50 income gaps across the world, 2021",
x = "", y = "", fill = "Top 10/Bottom 50 ratio") +
theme(legend.position="bottom",
axis.text.x=element_blank(), axis.ticks.x=element_blank(),
axis.text.y=element_blank(), axis.ticks.y=element_blank()) +
scale_fill_brewer(palette='YlOrRd')
df_f3 %>% anti_join(world_map_wir, by = c("Country" = "region"))
Filtering joins
anti_join(x,y, ...): return all rows from x without a
match in y.semi_join(x,y, ...): return all rows from x with a
match in y.Check dplyr cheat sheet, and Posit Primers Tidy
Data.
F5: Global income inequality: T10/B50 ratio, 1820-2020 - fit curve
F9: Average annual wealth growth rate, 1995-2021 - fit curve + alpha
F7: Global income inequality, 1820-2020 - pivot + fit curve
F10: The share of wealth owned by the global 0.1% and billionaires, 2021 - pivot + fit curve
F6: Global income inequality: Between vs. Within country inequality (Theil index), 1820-2020 - pivot + area
F11: Top 1% vs bottom 50% wealth shares in Western Europe and the US, 1910-2020 - pivot name_sep + fit curve
F8: The rise of private versus the decline of public wealth in rich countries, 1970-2020 - rename + pivot + pivot + fit curve
F15: Per capita emissions acriss the world, 2019 - add row names + dodge
(df_f5 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F5"))
df_f5 %>% ggplot(aes(x = y, y = t10b50)) + geom_line() + geom_smooth(span=0.25, se=FALSE)
df_f9 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F9"); df_f9
df_f9 %>%
ggplot(aes(x = p, y = `Wealth growth 1995-2021`)) + geom_smooth(span = 0.30, se = FALSE)
df_f7 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F7"); df_f7
df_f7 %>%
pivot_longer(cols = 2:4, names_to = "type", values_to = "value") %>%
ggplot(aes(x = y, y = value, color = type)) +
stat_smooth(formula = y~x, method = "loess", span = 0.25, se = FALSE)
df_f6 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F6"); df_f6
df_f6 %>% select(year = "...1", 2:3) %>%
pivot_longer(cols = 2:3, names_to = "type", values_to = "value") %>%
mutate(types = factor(type,
levels = c("Within-country inequality", "Between-country inequality"))) %>%
ggplot(aes(x = year, y = value, fill = types)) +
geom_area() +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(breaks = round(seq(1820, 2020, by = 20),1)) +
scale_fill_manual(values=rev(scales::hue_pal()(2)),
labels = function(x) str_wrap(x, width = 15)) +
labs(title = "Figure 6. Global income inequality:
\nBetween vs. within country inequality (Theil index), 1820-2020",
x = "", y = "Share of global inequality (% of total Theil index)", fill = "") +
annotate("text", x = 1850, y = 0.28,
label = stringr::str_wrap("1820: Between country inequality represents 11%
of global inequality", width = 20), size = 3) +
annotate("text", x = 1980, y = 0.70,
label = stringr::str_wrap("1980: Between country inequality represents 57%
of global inequality", width = 20), size = 3) +
annotate("text", x = 1990, y = 0.30,
label = stringr::str_wrap("2020: Between country inequality represents 32%
of global inequality", width = 20), size = 3)
df_f8 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F8"); df_f8
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
pivot_longer(3:4, names_to = "type", values_to = "value") %>%
ggplot() +
stat_smooth(aes(x = year, y = value, color = country, linetype = type),
span = 0.25, se = FALSE, size=0.75) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 8. The rise of private versus the decline of public
wealth in rich countries, 1970-2020",
x = "", y = "wealth as as % of national income", color = "", type = "")
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)')
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_")
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
pivot_longer(3:4, names_to = "type", values_to = "value")
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
pivot_longer(3:4, names_to = "type", values_to = "value") %>%
ggplot() +
stat_smooth(aes(x = year, y = value, color = country, linetype = type),
formula = y~x, method = "loess", span = 0.25, se = FALSE, size=0.75) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 8. The rise of private versus the decline of public wealth
\nin rich countries, 1970-2020",
x = "", y = "wealth as as % of national income", color = "", type = "")
df_f15 <- read_excel("data/WIR2022s.xlsx", sheet = "data-F15"); df_f15
df_f15 %>% mutate(region = rep(regionWID[!is.na(regionWID)], each = 3)) %>%
select(region, group, tcap) %>%
ggplot(aes(x = region, y = tcap, fill = group)) +
geom_col(position = "dodge") +
scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 10)) +
labs(title = "Figure 15 Per capita emissions across the world, 2019",
x = "", y = "tonnes of CO2e per person per year", fill = "")
Repeat the process during your EDA.
image
In RStudio,
1.1. Project
project_name.Rproj in your
project folder (directory)1.2. data folder (directory) data
1.3. Move (or copy) data for the project to the data folder
data.data:
Press Files at the right bottom pane and click data, the
data folder.2.1. Project Notebook: Memo
Create an R Notebook: File > New File > R Notebook
Add descriptive title.
2.2. Setup Code Chunk
Create a code chunk and add packages to use in the project and RUN the code.
2.3. Choose Source or Visual editor mode,
and start editing Project Notebook
Set up Headings such as: About, Data, Analysis and Visualizations, Conclusions
Under About or Data, paste url of the sites and/or the data
2.4. Edit a new file by saving as for a report
Assign a name you can recall easily when you import data. You may need to reload the data with options.
3.1. Use a package:
write_csv(wdi_shortname, "data/wdi_shortname.csv")wdi_shortname <- read_csv("data/wdi_shortname.csv")3.2. Use readr to read from data, your data
folder
df1_shortname <- read_csv("data/file_name.csv")3.3. Use readr to read using the url of the data
df2_shortname <- read_csv("url_of_the_data")write_csv(df2_shortname, "data/df2_shortname.csv")df2_shortname <- read_csv("data/df2_shortname.csv")3.5. Use readxl to read Excel data. Add
library(readxl) in the setup and run.
df4 <- read_excel("data/file_name.xlsx", sheet = 1)References: Cheat Sheet - readr, readr, readxl
4.1. Look at the data: suppose df is the data frame
dt <- as_tibble(df)head(df), str(df),
summary(df), dt, glimpse(dt)4.2. Look at each variable
4.3. Variation of each data: suppose x1 is a column
name.
df %>% ggplot() + geom_histogram(aes(x1), bins = 30)
df %>% drop_na(x1): see the rows with a value in
x1. If the value is NA, the row is not shown.
df_wo_na <- df %>% drop_na(x1) if you want to use
only the rows without NA in x14.4. Use dpylr and tidyr to change column
names, tidy data, and/or summarize data
rename, select, filter,
arrange, mutate, pivot_longer(),
pivot_wider(), group_by and
summarizeReferences: Cheat Sheet - dplyr and tidyr,
dplyr, tidyr
5.1. In combination with Stap 4 - data transformation, try various data visualization.
5.2. Keep a record of what you can observe by the visualization
5.3. Edit the list of questions by adding or polishing
5.4. Select several informative chart and add options
5.5. Look at examples from the textbooks or teaching site to have better visualization
References: Cheat Sheet - ggplot2 ggplot2, ggplot2 book
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
Government expenditure on education, total (% of GDP)
ID: SE.XPD.TOTL.GD.ZS
tidyr and WIR2022
a3_123456.nb.html by
replacing 123456 with your ID)
a3_123456.Rmd,a3_123456.nb.html,a3_123456.nb.html to Moodle.Choose a data with at least two categorical variables and at least two numerical variables.
Explore the data using visualization using
ggplot2
Observations based on your data visualization, and difficulties and questions encountered if any.
Due: 2023-01-23 23:59:00. Submit your R Notebook file in Moodle (The Fourth Assignment). Due on Monday!
“Tidy data sets are all alike; but every messy data set is messy in its own way.” — Hadley Wickham
“all happy families are all alike; each unhappy family is unhappy in its own way” - Tolstoy’s Anna Karenina
Correlation
iris %>% select(-5) %>% cor()
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Galton’s data. Regression.
Normalization, standalization
SDGs Academy
Trend
df4 <- WDI(
country="all",
indicator = c("SP.POP.TOTL","MS.MIL.XPND.GD.ZS","MS.MIL.TOTL.P1","NY.GDP.MKTP.CD"),
start = 1960,
end = NULL,
extra = FALSE,
cache = NULL,
latest = NULL,
language = "en"
)
colnames(df4) <- c("Country","iso2c","iso3c","Year","Population","Mil_Exp","Total_Mil_HC","GDP")
df4
df_wdi_poverty <- WDI(
country = "all",
indicator = c(national_poverty_rate = "SI.POV.NAHC", multidimentional_poverty_rate = "SI.POV.MDIM", gdpPercap = "NY.GDP.PCAP.KD", gini_indx = "SI.POV.GINI"), start = 1990,
end = 2021,
extra = TRUE
)
df_wdi_poverty %>%
group_by(country, year) %>%
mutate(mean_gdp_country = mean(gdpPercap)) %>%
mutate(mean_poverty_country= mean(national_poverty_rate)) %>%
ungroup() %>%
filter(!is.na(income)) %>% filter(!(income=="Aggregates")) %>% ggplot(aes(x = log10(mean_gdp_country))) + geom_point(aes(y = mean_poverty_country, color = income)) + labs(x = "GDP per capita", y = "poverty rate (% of population)", title = "Poverty rates and GDP per capita", subtitle="world countries, 1990-2021 average, by income level")
df_wdi_poverty %>%
group_by(country, year) %>%
mutate(mean_gdp_country = mean(gdpPercap)) %>%
mutate(mean_multipoverty_country= mean(multidimentional_poverty_rate)) %>%
ungroup() %>%
filter(!(region=="Aggregates")) %>% filter(!is.na(region)) %>% ggplot(aes(x = log10(mean_gdp_country))) + geom_point(aes(y = mean_multipoverty_country, color = region)) + labs(x = "GDP per capita", y = "Multidimentinal poverty rate (% of population)", title = "Multidimentional Poverty rates and GDP per capita", subtitle="world countries, 1990-2021 average, by region")
index sdg, ghi, academy
moocs
R4DS: Model basics https://r4ds.had.co.nz/model-basics.html modelr: https://modelr.tidyverse.org
Tidymodels: https://www.tidymodels.org
Tidyverse Skills for Data Science https://jhudatascience.org/tidyversecourse/ https://jhudatascience.org/tidyversecourse/model.html
machine learning:A Gentle Introduction to tidymodels https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/
Tidy Modeling with R https://www.tmwr.org
Importing Data:
read_csv("data/file_name.csv")read_csv(url)readxl::read_excel("data/excel_file_name.xlsx")read_delim(clipboard())https://yihui.org/knitr/options/
Data Source
Variables
Problems
Visualization
Model
Conclusions and Further Research
WDI, WIR, etc
Custom Word templates: https://bookdown.org/yihui/rmarkdown-cookbook/word-template.html
You can apply the styles defined in a Word template document to new Word documents generated from R Markdown. Such a template document is also called a “style reference document.” The key is that you have to create this template document from Pandoc first, and change the style definitions in it later. Then pass the path of this template to the reference_docx option of word_document
---
word_document:
reference_docx: "template.docx"
---
PowerPoint presentation: https://bookdown.org/yihui/rmarkdown/powerpoint-presentation.html
Custom templates: https://bookdown.org/yihui/rmarkdown/powerpoint-presentation.html#ppt-templates
---
powerpoint_presentation:
reference_doc: my-styles.pptx
---
YouTube: How To Create A PowerPoint Template
1.6 Comments on Week 2
1.6.0.1 Helpful Resources
Cheat Sheet in RStudio: https://www.rstudio.com/resources/cheatsheets/
‘Quick R’ by DataCamp: https://www.statmethods.net/management
An Introduction to R
1.6.0.2 Practicum
1.6.0.3 Assignments - See Moodle